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Abstract 

A new approach called RESID is proposed in this paper for estimating 
reliability of a software allowing for imperfect debugging. Unlike earlier 
approaches based on counting number of bugs or modelling inter-failure 
time gaps, RESID focuses on the probability of "bugginess" of different 
parts of a program buggy. This perspective allows an easy way to in- 
corporate the structure of the software under test, as well as imperfect 
debugging. One main design objective behind RESID is ease of imple- 
mentation in practical scenarios. 

1 Introduction 

With computer programs pervading all walks of modern life, software de- 
bugging has long been an area of active interest. There are three major 
aspects to this problem. Firstly, one needs better software development 
tools to avoid creating bugs. Secondly, one needs to be able to detect and 
correct bugs that have already crept in. The third goal is to estimate the 
reliability of a software program. Since no fool proof debugging method 
is known to exist, the third goal is no less important than the first two. 
Various approaches have been suggested in the literature to estimate the 
reliability of a given piece of software, ranging from simple profiling tech- 
niques (disregarding the stochastic nature of bugs) to elaborate stochastic 
models (that often overlooks the structure of the program). In this paper 
we propose a new technique called Reliability Estimation for Soft- 
ware under Imperfect Debugging (RESID) to make the twain meet: 
a statistical method based on maximum likelihood estimation that also 
takes the structure of the program into account. The model also allows 
the possibility of imperfect debugging, where a particular chunk of code 
is allowed to contain bugs (albeit with a reduced probability) even after 
multiple debugging sessions. 
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The paper is laid out as follows. In the next section we review various 
techniques proposed for the problem, with a brief discussion of their mer- 
its and demerits. Section 3 presents the new method from the theoretical 
viewpoint. Suggestions for practical implementation of the method are 
given in section 4. Section 5 presents some discussion about the perfor- 
mance of the technique based on simulation. Section 6 deals with some 
variations of RESID to suit specific needs. After a brief concluding sec- 
tion some probabilistic underpinnings of the method are outlined in an 
appendix. 

2 Review of existing techniques 

Many models and approaches have been suggested in the literature to 
assess software reliability. We shall briefly a review a selection of these 
techniques, without aspiring for comprehensiveness, which anyway is be- 
yond the scope of this short paper. An extensive literature review is given 
in [11]. Software reliability is typically deflned as the probability of failure- 
free operation of a computer programme in a specifled environment for a 
specified period of time [7]. As [2] points out the statistical models for 
software reliability come in two distinct fiavours, those that deal with time 
between successive failures, and those that deal with counting bugs in a 
program. 

The former approach, pioneered by [5,6], makes the (somewhat un- 
realistic) assumption that inter-failure times are exponential in nature, 
and are independent of one another. Methods of this genre also make the 
assumption that a bug is always rectified when it is detected. This un- 
fortunately leaves no place for wrong or incomplete fixes, a phenomenon 
that ubiquitously plagues the software industry. A related approach is 
taken by [9, 10], where the authors try to fit time series models to the 
inter-failure times. 

The second approach uses point processes to model occurrences of 
bugs [4]. The paper [8] is a typical example, where the author employs 
Poisson processes for this purpose. One notable feature of this approach 
is that the same bug is allowed to recur. Indeed, the author suggests that 
after detection a bug should either not be removed or at least have its 
place marked, so that every subsequent pass through that position may 
be counted. This idea has been extended in [3] where the presence of 
multiple bugs is modelled as multiple Poisson processes running indepen- 
dently in parallel. The main interest there lies in estimating the number 
of processes. 

However, as noted in [11], the plethora of proposed, pedantic models 
contrasts sadly with the paucity of practically implementable ones. Due 
to resource constraints, debuggers and programmers often have to resort 
to ad hoc plans to yield quick results, often in response to the needs of 
some irritated customer demanding a fix for a particular bug. Most of 
the methods proposed in the literature are a bit too elaborate to cope 
with such real life scenarios. Some practical method based on a reason- 
able statistical model would be a useful addition to a software engineer's 
repertoire. In this paper we seek to propose such a method. 
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3 RESID 



One of the trickiest aspect of quantifying software reliability is to come up 
with a quantitative definition of a bug. Often a single mistake in a software 
triggers multiple branches of a system to fail. Techniques based on only 
the occurrences of failures often count each failed branch as a separate 
bug, while from the viewpoint of software management these should be 
considered as a single bug. We circumvent this inherent ambiguity in the 
definition of a bug by looking instead at the concept of "bugginess" as 
follows. 

We can consider a piece of code as a flowchart with branches and 
loops. Control may flow down different branches depending on the initial 
data. However, every program consists of some chunks which is defined as 
a sequence of consecutive instructions without any embedded branching 
in it. Thus, once control enters a chunk it must either flow through the 
chunk along a unique linear path, or must crash the program (possibly 
because of a bug in some earlier chunk. 

We shall assume that each chunk has some probability p of being 
"buggy". More specifically, p is the chance that we encounter some bug 
in that chunk while running the software with a random data. Thus, 
strictly speaking randomness enters not only due to the inadvertent er- 
rors of the programmers, but also due to the choice of the input data. 
We shall also assume that the event that one chunk contains a bug is 
independent of another chunk containing a bug. This is not as imprac- 
tical as it may sound, as bugs are born of inadvertent mistakes on part 
of the programmers, and not as a result of the control structure relating 
the chunks. Our assumption of independence does not preclude the pos- 
sibility that a bug in one chunk may wreak havoc in a subsequent chunk. 
Mathematically inclined readers not content with this explanation may 
see the appendix for a more rigorous presentation of this idea. 

To assess the reliability of a piece of software we start with the struc- 
ture of the program in terms of the chunks. During the debugging phase 
the program is run multiple times, each time with independent initial 
data. For each run we record the following information. 

1. whether the run has terminated correctly or not, 

2. if the run indicates the presence of a bug, then which chunk contains 
the bug, 

3. which chunks have been executed and how many times. 

The first two pieces of information are available from any standard 
debugging session. In order to collect the frequency of execution of the 
chunks one has to embed a logging command in each chunk. Such an 
embedding can be easily achieved automatically using software tools. We 
assume that a bug is fixed (possibly imperfectly) once it is detected. We 
shall later point out two variants of the same approach to cope with 
situations where a bug cannot be removed, or where multiple bugs are 
identified in a single run. 

Notice that even though we feed independent initial values in the pro- 
gram each time, the collected data are not iid in nature, since the under- 
lying software changes with each debugging. 
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We stochastically model the software debugging mechanism as follows. 
Initially each chunk is believed to be buggy with probability p. 

Every time a chunk is "debugged" we assume that the probability of its 
still remaining buggy is scaled down by a known factor a G (0, 1), which 
measures the debugging inefficiency. Thus, after detection and correction 
of its first k bugs, a chunk has probability pa'' of remaining buggy. A 
value of a close to implies efficient debugging, while a value close to 1 
implies the opposite. 

We shall consider p as a measure of unreliability (or, rather, the lack 
thereof) of the over all software. Each chunk gets its own unreliability 
score pa''. It is not difficult to come up with a color coding scheme to 
depict the unreliability scores of the different chunks diagrammatically, 
e.g., using a UML diagram. 

If p is sufficiently small we might consider the software as reliable 
enough. If, however, p is large, then we may like to focus our debugging 
efforts on the chunks with higher unreliability scores. 

We shall employ maximum likelihood estimation to estimate p. Writing 
down the likelihood function, however, is a bit tricky here, as we have to 
take the structure of the software into account. The procedure is best 
explained with a simple example. 

Consider a simple program with control flow as shown in Fig 1. It has 
3 chunks each labelled with an arrow and number. The control starts in 
chunk 1, then comes to an if -class, and branches out into chunk 2 or 3. 




Fig 1: A 3-chunk program with an if -clause 

The program is executed 5 times with the following results: 

1. 1; bug in 1 (i.e., the program crashed in chunk 1 due to a bug in 
that chunk) 

2. 1,2; bug in 2 

3. 1,3; no bug 

4. 1, 2; bug in 1 (i.e., the program continued up to chunk 2, but a bug 
was found in chunk 1) 

5. 1,3; bug in 3 

We assume that the buggy chunk is (imperfectly) debugged after each 
unsuccessful run. Also, the chunks visited after passing through a buggy 
chunk produce unreliable results, and so are to be ignored. For example, 
we shall truncate the record for the fourth run above to 

1; bug in 1. 

The probability of bugginess of each chunk is listed below, along with 
the likelihood values for each run: 
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P(l; bug) = p 
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P 


P(l,2; bug ) = {l~pa)p 
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pa 


pa 


P 


P(l,3; nobug) = {I - pa){l - p) 


4 


pa^ 


pa 


P 


P(l; bug ) = pa 


5 


pd^ 


pa 


pa 


P(l,3; bug) = (l-pa2)p 



Taking the product of the last column we get the likelihood of the 
entire data set as 



L{p) = constant x — p)(l — pa)^(l — pa^). 

It is not hard to see that this scheme can be generalized to any branching 
structure. However, the situation becomes somewhat different in presence 
of loops. A bug inside a loop may not be triggered during the very first 
pass through the loop. In practice it is often difficult to keep track of the 
exact pass when a bug inside a loop is triggered for the first time. So if 
the program halts due to a buggy chunk inside a loop, all that we can 
be sure of is that the chunks leading to it have worked correctly at least 
once. We have no way of knowing if the bug has already been triggered 
before subsequent passes through the loop or not. We explain this idea 
with an example. 

Consider the loop structure shown in Fig 2. 




Fig 2: A 3-chunk prgroam with a loop 

Suppose that a run through this program produces the following record; 

1,2,2,2,3; bug in 2. 

As should be obvious to anybody with even moderate debugging ex- 
perience, it is hard to detect which pass(es) through chunk 2 has (have) 
triggered the bug. As a result all that we can be sure of is that chunk 1 
has worked fine in this run, with at least one pass of chunk 2 failing. Once 
a bug is triggered the subsequent chunks cannot be reliably debugged. So 
we shall truncate the record to 

1,2; bug in 2. 

As can be seen easily the likelihood under this model is of the form 

k 

L(p) ocp'"]^(l -pa'-y. 

1=0 

Here k is the maximum number of debugging session for any chunk, 
m is the number of bugs detected and removed, and rii (for i = 0, 1, ...) is 
the number of perfect runs for chunks with exactly i debugging attempts. 
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The log-likelihood is (up to an additive constant) 

fe 

= mlogp-l- log(l ~pa). 

i=0 

The following fact is useful for numerically maximising this function 
forpG (0,1). 

Lemma: For any program and any debugging outcomes, the model has 
a strictly concave log-likelihood function. /// 

Proof: The functions logp and log(l — ^) are strictly concave for any a > 0, 
and a linear combination of concave functions with positive coefficients is 
again strictly concave. /// 

This fact implies the uniqueness of MLE p, if it exists. Unfortunately, 
MLE may not always exist. But it does exist under the following fairly 
mild condition. 

Lemma: If m, no > then l{p) has unique maximum over (0, 1)./// 
Proof: This is because 

lim i 

p->0+ p 

lim ^ — - 

Also 

lim i 

p^i- p 

lim — ^ — 

p^i- 1 — p 

lim — ^ — 

P^i- 1 — pa 

So if m, no > 0, we have 

/(0-f) > and £'{!-) < 0. 

Strict concavity from the last lemma now clinches the argument. /// 

The condition of this lemma has a simple interpretation: m > means 
at least one bug is encountered somewhere during some run of the pro- 
gram. The condition no > means at lest one chunk has worked properly 
in the very first attempt. It is easy to see that the probability of both 
these happening goes to 1 as the number of chunks go to infinity. 

Incidentally, it may be shown without much additional effort that the 
condition m > is necessary for the existence of MLE. However, the 
condition tiq > is only sufficient. In fact, this is one of a general class of 
sufficient conditions of the form > a~' — 1. 

One may now easily apply numerical methods like Newton-Raphson 
to solve 

£'ip) = 0, 



— oo 

G R for i = 0, 1, ... 

e K 

> oo 

G R for i = 1,2, ... 
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or 



k 



m 



E 



1 — pa'' 



= 0. 



P 



However, here we can avoid computing the second derivative needed for 
Newton-Raphson iteration by using bisection method. By virtue of the 
last lemma we can perform bisection method over the interval [e, 1 — e] for 
some suitably small e > 0. 



The main aim of this paper is to propose a systematic debugging technique 
that can be easily integrated with existing methods. Our exposition so far 
has been primarily theoretical. This section outlines how our approach 
fits into a typical debugging session, where input data may come from a 
customized design or user feedback or simply generated randomly. 

The first step is to identify the chunks in the software. This can be 
achieved easily by simple lexical analyzers and parsers (like those gener- 
ated by flex and bison [1]). Such preprocessing steps are common in 
many program validating scenarios. In our case the preprocessing step 
creates a data base of the chunks, associating each chunk with its file 
name and the first and last line numbers. It also embeds a data logging 
command at the start of the each chunk, such that the chunk number is 
recorded in a log file the moment control enters that chunk. 

After this simple preprocessing is over the actual debugging starts, 
which consists of repeated runs each time with fresh initial data. After 
each run the programmer checks if the output is OK or not. If it is, then 
this fact is recorded. If something has gone wrong then the programmer 
debugs the program, and identifies the line(s) that required correction. 
Thus, for each run we get the logged record of visited chunks, as well as 
the location of the bug, if any. 

The resulting data set is now ready to be analyzed with our approach. 
First, we identify the chunk containing the buggy line(s). We shall discuss 
later the scenario where multiple chunks are corrected in a single run. 
Next we identify the first occurrence of this chunk in the logged record, 
and discard everything after this occurrence. This is necessary because 
once a buggy chunk is visited, all subsequent steps are unreliable. 

Next, we extract the statistics m,k and rii's from the accumulated 
records, and use these to compute and maximize i{p) for p £ (0, f). 

It should be noted that all the steps can be easily automated, and 
hardly cause any disruption to the usual work flow of the programmer in 
charge of debugging. This seamless integration with the existing habits 
of programmers is a major strong point of RESID. 



The proper evaluation of RESID may only be done in an industrial set 
up where a large, complex software is actually being debugged. In this 
paper we present the results using a simulated toy example. We start 



4 Implementation 



5 Results 
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with a C program, and simulate a bug in each chunk with probabihty pa^ 
where p is some chosen value of the parameter of interest, a = 0.9 (chosen 
arbitrarily) measures debugging inefflciency, and r is the number of times 
this chunk has already been debugged. 

The first example is a simple program consisting of just 4 chunks as 
shown in the flowchart in Fig 3. Each rhombus represents an if -clause, 
and the rounded rectangle represents a loop that extends up to the circular 
connector. In the simulated runs we take each branch in an if-clause 
with equal probability. Also the loop is run for a random number of steps 
generated uniformly from {1, 100}. 



The debugging session is run 100 times each with the values p = 
0.2, 0.4, 0.6 and 0.8. The log-likelihood functions are shown in Fig 4. 




Fig 3: A simple flowchart 
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Fig 4: Simulated log-likelihood functions 



Each time the peak is very near the actual value of p (shown with a 
vertical line). 

In order to estimate the variances and also to judge the effect of a 
we apply 50-run simulations 100 times for three different values p = 
0.3, 0.6, 0.9 and three different values of q = 0.3, 0.6, 0.9. The resulting 
mean MLE's are as follows. 
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The corresponding variances are 
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6 Variants 

The main aim of this paper is to present practically applicable software 
debugging procedure. In view of practical implementation the main idea 
presented in the last section may need to be changed to some extent. We 
discuss some such variations in this section. 



6.1 Chunk-specific bug probabilities 



The assumption that each chunk has the same a priori chance of being 
buggy may be stretching imagination a bit too far. After all, a chunk 
consisting of just a few hues of initialization is lot less likely to spring a 
bug than a chunk containing many lines of complex code. Yet allowing 
each chunk to have its own p will cause an explosion in the number of 
parameters. A reasonable balance may be obtained by positing that that 
the chance of bugginess of a chunk is related to the number of lines in it 
as follows: 

P(a chunk is buggy) = 1 — (1 — p)^, (*) 

where K is the number of lines in the chunk, and p £ (0, 1). This for- 
mula may be motivated by considering each line to have probability p of 
springing a bug, and assuming that the event that one line is buggy is 
independent of another line being buggy. Then (*) gives the chance of 
the chunk containing at least a single bug. 

In this case the likelihood function is of the form 

k 

L{p)<xp"^l[Y[{l-{l-{l~pf'^W). 

As before this is just the form. Trying to interpret the quantities like Kij 
would not lead to any easy way to compute it for a given data set. One 
must rely on automated tools for its evaluation. Complicated as it is, the 
log-likelihood function still obeys the lemmas given earlier. 

6.2 Classification of chunks 

The inefficiency of debugging may depend on the type of code a chunk 
contains. A chunk involving numerical analysis is typical harder to de- 
bug than one consisting of some initialization commands. Accordingly it 
may be possible to classify all the chunks into a small number of broad 
categories , and assign different debugging inefficiency factors to these. 

6.3 Multi-chunk debugging 

After a program terminates in an undesirable fashion {e.g., crashes or 
gives wrong output), a programmer has to carefully go through the code 
to identify the corresponding bug. In some rare situations the bug may 
not be confined in a single chunk, as we have assumed so far. But our 
approach can be easily adapted to such a scenario. The programmer is to 
just record the lines that needed correction. All the corresponding chunks 
are then marked as buggy, and the logged record for the run is truncated 
up to and including the first occurrence of any of these chunks. Also the 
chance of these chunks still remaining buggy is updated by scaling down 
the current probabilities with a factor a. 

The corresponding change to the log-likelihood is notationally cum- 
bersome, but easily effected in a computer. 
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6.4 Bug detected, but not removed 

Since software debugging is often done under resource constraints, not all 
bugs whose presence can be detected can be removed. This is especially 
true about bugs that are deemed more "esoteric" in nature. In such a case 
we can still apply our approach, but we do not scale down the probability 
for the chunk. 

7 Conclusion 

In this paper we have proposed a new software reliability technique. The 
technique seeks to integrate ad hoc practical debugging methods with sta- 
tistical modelling. We have mentioned how a classical debugging session 
can be easily cast into the new approach with the use of the software tools. 

The author likes to thank Prof Anup Dewanji for introducing him to 
this area of research. 

8 Appendix 

Throughout the paper we have made the assumption that the existence of 
bug(s) in a chunk/line is independent of the existence of a bug in another 
chunk/line. We shall try to justify this heuristic statement here in terms 
of rigorous probability argument. Notice that when we say that "a chunk 
is found buggy" we actually mean two things: 

1. that the chunk contains a bug, 

2. that this bug is triggered by the data we are using. 

So there are two sources of randomness. The former is introduced 
by the programmer, the latter by the user. We shall accordingly employ 
two probability spaces {X,J^x,Px) and {y,J-y,Py) for the programmer 
and user, respectively. We may think of X as all ways of creating bugs 
available to the programmer. Similarly, y denotes the set of all possible 
user inputs. We make the following assumption. 

Assumptions: 

1. The user and the programmer behave independently. 

2. Let Ai,A2 X he the events that there are bugs in two (disjoint) 
chunks. Then A\,A2 are mutually independent. 

3. Let B\, B2 ^ y he the events that the user chooses an input that 
triggers any two distinct bugs. Then Bi,B2 are independent. 

Thanks to the first assumption we are working in the product space 

{X ®y,Tx(»Ty, Pxy =P^®Py). 

If a user encounters a bug in a particular chunk, this event is 
Ax B X ®y, 

where A <^ X is the event that the programmer has left a bug in that 
chunk, and B C y denotes the event that the user happens to have chosen 
an input to trigger it. 
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Thus when we say that the chance of finding a given chunk to be buggy 
is p we actually mean Pxy{A x B) = p, and not Px{A) = p. 

Now let us fix any two distinct chunks and accordingly define Ai £ J-x 
as the event of chunk i containing a bug. Also let Bi £ Ty be the event 
that the user chooses an input value that detects a bug (if any) in chunk 
i. 

Let Ci G J-xy be the event that the user actually encounters a bug in 
chunk i. Then Ci, C2 are independent in the product space, because 

Pxy{Ci n C2) = P..y{(Ai X Bi) n {A'z X B2)) 

= Pxy{{Ai n A2) X {Bi n B2)) 
= Px{AinA2)Py{BinB2) 
= Px{Ai)Px{A2)PyiBi)Py{B2) 

= Pxy{Al X Bl)Pxy{A2 X B2) 
= Pxy{Cl)Pxy{C2)- 
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